Text Analysis of Biden and Trump Speeches During the 2020 Presidential Election

Introduction

The United States presidential election is one of the most followed political events in the world. As such, there are many who study the data involved in the hopes of both making predictions and informing the public on the current state of the election. In this blog post, we share our analysis of data from a key component of the election process: speeches given by the candidates. In particular, we analyzed text from speeches given by Joe Biden and Donald Trump during the lead up to the 2020 election. Our primary questions we hoped to answer were:

  1. What are the most common words and phrases used by Trump and Biden?
  2. What are the relationships between those words/phrases?
  3. How did the frequency of these words/phrases change over time?

We answered these questions through a series of visualizations which display results acquired via various techniques in text analysis.

Data

Visualizations

In order to address our three posed questions, we created three types of visualizations, one for each question. To identify the most frequent words used in their speeches, we created wordclouds with fontsize corresponding to word frequency. To identify relationships between the words, we created network graphs with edge sizes corresponding to “closeness” of these words within the documents. (We will define “closeness” in the network section). Lastly, we created line graphs to identify changes in word frequencies over time.

Word Frequency Wordclouds

The word cloud is used to highlight the frequently used words in both Donald Trump and Joe Biden’s speeches. We made used of stop words to remove words that are frequently used but provide little information. Some common English stop words include “I”, “she’ll”, “the”, etc. We created a vector to add our own stop words into the built in stopwords dataframe. By analyzing the most frequent words used by the two candidates we can get a better insight into the main issues the two candidates hope to solve or their main policies. For example, a frequent word used by Joe Biden is covid because one of Joe Biden’s main campaign policies was the eradication of the virus in the US and China was one of Trump’s frequently used words because China is the US’s foreign trade rival.

Donald Trump WordCloud

Joe Biden WordCloud

Network Visualizations

To understand the relationships between speech words, we looked at two types of words: popular election topics such as climate change, health care, and COVID-19; and the most common words across all speeches. For each of these sets, we needed to define some metric for “closeness”. To do this, we emulated an analysis of Game of Thrones. Specifically, we defined the closeness of two words (within the full dataset) as the number of times that the words occur within d words of each other in a single speech, where d is a parameter specifying this word distance. For the first set of words (the popular election topics), we chose d to be 50 (a relatively high value) since the words considered are more specific in nature (than the other set of words), and therefore a higher choice of d will provide us with enough data to observe significant relationships. On the other hand, we chose d to be 10 for the most common words. Since the words in this set occur much more frequently than the popular election topics, we needed to choose a smaller value of d to refine the selection and avoid noisy data.

Text Mining

In order to mine the data in this way, we used Python over the speeches and words. For each speech, we iterated over all words, and for each relevant word, we collected relevant nearby words (those relevant words within d of the word being considered). Some sample code is provided:

output = [] # each element is one line of the csv output
output.append("word1,word2,weight") # csv header
output_dictionary = defaultdict(lambda: 0) # maps frozenset (of two strings) to int

for input_file in input_files: # these are the speeches
  with open(input_file) as f:
    text = f.read()
    
  regex = re.compile('[^a-zA-Z\' ]')
  text = regex.sub('', text)

  words = text.split(" ")
  for i in range(len(words)):
    word = words[i].lower()
    if word not in top_words: # top_words is the set of relevant words to be considered
      continue
          
    # acquire nearby words
    nearby_words = []
    for j in range(i - dist, i + dist + 1): # check all j within `dist` of word
      if j >= 0 and j < len(words) and j != i: # if j is a valid index
        if words[j] in top_words and words[j] != word:
          nearby_words.append(words[j])

    # increase weights accordingly
    for nearby_word in nearby_words:
      # create hashable frozenset object
      to_hash = frozenset([word, nearby_word])
      output_dictionary[to_hash] += 1

  for pair_set in output_dictionary.keys():
    pair_list = list(pair_set)
    weight = output_dictionary[pair_set]
    output.append(pair_list[0] + ',' + pair_list[1] + ',' + str(weight))

  output = '\n'.join(output)

Note that we actually double count connections between words, but it doesn’t matter for our purposes because we double count all connections, so in comparing these relationships there is no observable difference.

Network Analysis for Most Commonly Used Words

Our second set of networks focuses on the most commonly used words across all speeches. The same procedure was used as above.

One observation from these networks is how central the word ‘people’ is to both candidates’ speeches. This is the most commonly used word by both candidates, and therefore it makes sense for it to be central in these networks. Furthermore, Biden’s use of commmon words is more focused, with nearly all of the strongest edges containing the word ‘people’. Trump’s use of these words, on the other hand, is more spread out.

Speech Analysis Over Time

Limitations, Pitfalls, and Future Research

While we feel like we adequately answered our specific questions, it’s definitely hard to make inference outside of that with larger questions, like how their different speech patterns contributed to the 2020 election. One interesting idea for future research is to analyze sentiment in their text, to determine the tone of language used and how that could shift over time.